Solution: Use Columns Unambiguously

The sections that follow describe several ways we can resolve this antipattern and write unambiguous queries.

Query only functionally dependent columns#

The most straightforward solution is to eliminate ambiguous columns from the query.

Retrieving the latest reported date of a bug by using a straightforward solution

The query reveals the date of the latest bug per product, even though it doesn’t report the bug_id corresponding to that latest bug. Sometimes this is enough, and we shouldn’t overlook this simple solution.

Using a correlated subquery#

A correlated subquery contains a reference to the outer query and produces different results for each outer query row. We can use this to find the latest bug per product by running a subquery to search for bugs with the same product and a greater date. When the subquery finds none, the bug in the outer query is the latest.

Retrieving the latest reported date of a bug using correlated subquery

This is a simple solution that is readable and easy to code. However, we must remember that this solution isn’t likely to be the best for performance because correlated subqueries are executed once for each row of the outer query.

Using a derived table#

We can use a subquery as a “derived table”, producing an interim result that contains only the product_id and the corresponding greatest bug report date for each product. Then, we can use this result to join against the tables so that the query result contains only bugs with the latest date per product.

Retrieving the latest reported date of a bug using a derived table

Under this solution, we can get multiple rows per product if the latest date returned by the subquery matches multiple rows. If we need to ensure a single row per product_id, we can use another grouping function in the outer query:

Retrieving the latest reported date of a bug using a derived table while ensuring a single row per product

We can use the derived table solution as a more scalable alternative to the correlated subquery. The derived table is non-correlated, so most database brands should be able to execute the subquery once. However, the database must store the interim result set in a temporary table, so this solution still isn’t the best for performance.

Using a JOIN#

We can create a join that tries to match against a set of rows that may not exist. This type of join is called an outer join. Where the matching rows don’t exist, NULL is used for all columns in that nonexistent row. So, where the query finds NULL, we know that no such row was found.

Retrieving the latest reported date of a bug using a JOIN

It takes a few minutes of gazing at this query, and perhaps some doodles on notepaper, for most people to see how it works. But once we do, this technique can be an important tool.

We use the JOIN solution when the scalability of the query over large sets of data is important. Although it’s a tougher concept to grasp and therefore more difficult to maintain, it often scales better than a subquery-based solution. We must remember to measure the performance of several query forms instead of assuming that one performs better than the other.

Using an aggregate function for extra columns#

We can make the extra column comply with the Single-Value Rule by applying another aggregate function to it.

Retrieving the latest reported date of a bug using another aggregate function

We use this solution only when we can rely on the latest bug_id being the bug with the latest date, that is, if the bugs are guaranteed to be reported in chronological order.

Concatenating all values per group#

Finally, we can use another aggregate function on bug_id to avoid violating the Single-Value Rule. MySQL and SQLite support a function GROUP_CONCAT() that concatenates all the values in the group into one value. By default, this is a comma-separated string.

Concatenating all values per group using GROUP_CONCAT()

This query doesn’t reveal which bug_id corresponds to the latest date; the bug_id_list includes all bug_id values in each group.

Another disadvantage of this solution is that it isn’t standard SQL, and other brands of databases don’t support this function. Some brands of databases support custom functions and custom aggregate functions. For example, here’s the solution for PostgreSQL:

/
main.sql
Custom aggregate functions in PostgreSQL for concatenating values

Some other brands of database don’t support custom functions, so the solution may require writing a stored procedure to loop over a non-grouped query result, concatenating values manually.

We can use this solution when we expect the extra column to have a single value per group but the column still violates the Single-Value Rule.

Antipattern: Reference Non-grouped Columns
Synopsis: Random Selection
Mark as Completed
Report an Issue